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'^ ' Abstract 

^"^ We study the problem of dynamic spectrum sensing and access in cognitive radio systems as a 

partially observed Markov decision process (POMDP). A group of cognitive users cooperatively tries 
to exploit vacancies in primary (licensed) channels whose occupancies follow a Markovian evolution. 
O , We first consider the scenario where the cognitive users have perfect knowledge of the distribution 

of the signals they receive from the primary users. For this problem, we obtain a greedy channel 

^ ' selection and access policy that maximizes the instantaneous reward, while satisfying a constraint on 

!>■ 

r — I the probability of interfering with licensed transmissions. We also derive an analytical universal upper 
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bound on the performance of the optimal policy. Through simulation, we show that our scheme achieves 



f-^ . good performance relative to the upper bound and improved performance relative to an existing scheme. 

QQ ■ We then consider the more practical scenario where the exact distribution of the signal from the 



primary is unknown. We assume a parametric model for the distribution and develop an algorithm that 
can learn the true distribution, still guaranteeing the constraint on the interference probability. We show 
that this algorithm outperforms the naive design that assumes a worst case value for the parameter. We 
also provide a proof for the convergence of the learning algorithm. 
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I. Introduction 

Cognitive radios that exploit vacancies in the licensed spectrum have been proposed as a 
solution to the ever-increasing demand for radio spectrum. The idea is to sense times when 
a specific licensed band is not used at a particular place and use this band for unlicensed 
transmissions without causing interference to the licensed user (referred to as the 'primary'). An 
important part of designing such systems is to develop an efficient channel selection policy. The 
cognitive radio (also called the 'secondary user') needs to adopt the best strategy for selecting 
channels for sensing and access. The sensing and access policies should jointly ensure that the 
probability of interfering with the primary's transmission meets a given constraint. 

In the first part of this paper, we consider the design of such a joint sensing and access policy, 
assuming a Markovian model for the primary spectrum usage on the channels being monitored. 
The secondary users use the observations made in each slot to track the probability of occupancy 
of the different channels. We obtain a suboptimal solution to the resultant POMDP problem. 

In the second part of the paper, we propose and study a more practical problem that arises 
when the secondary users are not aware of the exact distribution of the signals that they receive 
from the primary transmitters. We develop an algorithm that learns these unknown statistics 
and show that this scheme gives improved performance over the naive scheme that assumes a 
worst-case value for the unknown distribution. 

A. Contribution 

When the statistics of the signals from the primary are known, we show that, under our 
formulation, the dynamic spectrum access problem with a group of cooperating secondary users 
is equivalent in structure to a single user problem. We also obtain a new analytical upper bound 
on the expected reward under the optimal scheme. Our suboptimal solution to the POMDP is 
shown via simulations to yield a performance that is close to the upper bound and better than 
that under an existing scheme. 

The main contribution of this paper is the formulation and solution of the problem studied 
in the second part involving unknown observation statistics. We show that unknown statistics of 



the primary signals can be learned and provide an algorithm that learns these statistics online 
and maximizes the expected reward still satisfying a constraint on interference probability. 

B. Related Work 

In most of the existing schemes lUl, |l2l in the literature on dynamic spectrum access for 
cognitive radios, the authors assume that every time a secondary user senses a primary channel, 
it can determine whether or not the channel is occupied by the primary. A different scheme 
was proposed in []3l and JH where the authors assume that the secondary transmitter receives 
error-free ACK signals from the secondary's receivers whenever their transmission is successful. 
The secondary users use these ACK signals to track the channel states of the primary channels. 
We adopt a different strategy in this paper. We assume that every time the secondary users sense 
a channel they see a random observation whose distribution depends on the state of the channel. 
Our approach is distinctly different from and more realistic than that in [[T]|, |l2l since we do 
not assume that the secondary users know the primary channel states perfectly through sensing. 
We provide a detailed comparison of our approach with that of [[3]| and BU after presenting our 
solution. In particular, we point out that while using the scheme of ^ there are some practical 
difficulties in maintaining synchronization between the secondary transmitter and receiver. Our 
scheme provides a way around this difficulty, albeit we require a dedicated control channel 
between the secondary transmitter and receiver. 

The problem studied in the second part of this paper that involves learning of unknown 
observation statistics is new. However, the idea of combining learning and dynamic spectrum 
access was also used in [5J where the authors propose a reinforcement-learning scheme for 
learning channel idling probabilities and interference probabilities. 

We introduce the basic spectrum sensing and access problem in Section HI] and describe 
our proposed solution in Section [nil In Section |IVl we elaborate on the problem where the 
distributions of the observations are unknown. We present simulation results and comparisons 
with some existing schemes in Section |Vl and our conclusions in Section |Vll 

II. Problem Statement 

We consider a slotted system where a group of secondary users monitor a set C of primary 
channels. The state of each primary channel switches between 'occupied' and 'unoccupied' 



according to the evolution of a Markov chain. The secondary users can cooperatively sense any 
one out of the channels in C in each slot, and can access any one of the L = \C\ channels in the 
same slot. In each slot, the secondary users must satisfy a strict constraint on the probability of 
interfering with potential primary transmissions on any channel. When the secondary users access 
a channel that is free during a given time slot, they receive a reward proportional to the bandwidth 
of the channel that they access. The objective of the secondary users is to select the channels for 
sensing and access in each slot in such a way that their total expected reward accrued over all 
slots is maximized subject to the constraint on interfering with potential primary transmissions 
every time they access a channeLj. Since the secondary users do not have explicit knowledge 
of the states of the channels, the resultant problem is a constrained partially observable Markov 
decision process (POMDP) problem. 

We assume that all channels in C have equal bandwidth B, and are statistically identical and 
independent in terms of primary usage. The occupancy of each channel follows a stationary 
Markov chain. The state of channel a in any time slot k is represented by variable Sa{k) and 
could be either 1 or 0, where state corresponds to the channel being free for secondary access 
and 1 corresponds to the channel being occupied by some primary user. 

The secondary system includes a decision center that has access to all the observations made 

The observations are transmitted to the decision center 
over a dedicated control channel. The same dedicated channel can also be used to maintain 
synchronization between the secondary transmitter and secondary receiver so that the receiver 
can tune to the correct channel to receive transmissions from the transmitter. The sensing and 
access decisions in each slot are made at this decision center. When channel a is sensed in slot 
k, we use X„(A;) to denote the vector of observations made by the different cooperating users on 
channel a in slot k. These observations represent the sampled outputs of the wireless receivers 
tuned to channel a that are employed by the cognitive users. The statistics of these observations 
are assumed to be time-invariant and distinct for different channel states. The observations on 

'We do not consider scheduling policies in this paper and assume that the secondary users have some predetermined scheduling 
policy to decide which user accesses the primary channel every time they determine that a channel is free for access. 

^The scheme proposed in this paper and the analyses presented in this paper are valid even if the cooperating secondary users 
transmit quantized versions of their observations to the fusion center Minor changes are required to account for the discrete 
nature of the observations. 



by the cooperating secondary user: 






channel a in slot k have distinct joint probability density functions /o and /i when Sa{k) = 
and Sa{k) = 1 respectively. The collection of all observations up to slot k is denoted by X^, and 
the collection of observations on channel a up to slot k is denoted by X^- The channel sensed 
in slot k is denoted by u^, the sequence of channels sensed up to slot k is denoted by m^, and 
the set of time slots up to slot k when channel a was sensed is denoted by Kj^. The decision 
to access channel a in slot k is denoted by a binary variable 5a{k), which takes value 1 when 
channel a is accessed in slot k, and otherwise. 

Whenever the secondary users access a free channel in some time slot k, they get a reward B 
equal to the bandwidth of each channel in C. The secondary users should satisfy the following 
constraint on the probability of interfering with the primary transmissions in each slot: 

P{{5,{k) = l}\{Sa{k) = l})<C 

In order to simplify the structure of the access policy, we also assume that in each slot the 
decision to access a channel is made using only the observations made in that slot. Hence it 
follows that in each slot k, the secondary users can access only the channel they sense in slot k, 
say channel a. Furthermore, the access decision must be based on a binary hypothesis test [|6| 
between the two possible states of channel a, performed on the observation X^{k). This leads 
to an access policy with a structure similar to that established in H. The optimal test ^ is to 
compare the joint log-likelihood ratio (LLR) C{X_^{k)) given by, 

to some threshold A that is chosen to satisfy, 

P i{CiXM)< A} \{S^{k) = !}) = ( (1) 

and the optimal access decision would be to access the sensed channel whenever the threshold 
exceeds the joint LLR. Hence 

6aik) = I{C{2^{k))<A}I{ut:=a} (2) 

and the reward obtained in slot k can be expressed as, 

where Xe represents the indicator function of event E. The main advantage of the structure of 
the access policy given in (O is that we can obtain a simple sufficient statistic for the resultant 



POMDP without having to keep track of all the past observations, as discussed later. It also 
has the added advantage Q that the secondary users can set the thresholds A to meet the 
constraint on the probability of interfering with the primary transmissions without relying on 
their knowledge of the Markov statistics. 

Our objective is to generate a policy that makes optimal use of primary spectrum subject to 
the interference constraint. We introduce a discount factor a G (0, 1) and aim to solve the infinite 
horizon dynamic program with discounted rewards flTJ- That is, we seek the sequence of channels 

oo 

{uq,u\, . . .}, such that the V^Q;'^E[ffc] is maximized, where the expectation is performed over 

k=0 

the random observations and channel state realizations. We can show the following relation based 
on the assumption of identical channels: 

E[h] = E 
= E 



^^{5„, (fc)=0}^{£(X„^ (fc))<A} 



E 



^^{Su,{k)=0}^{c(X^^{k))<A}\S^kik) 



where. 



E [5(1 - e)%„jfc)=o}] (4) 



6 = P({£(X„(fc)) > ^}\{Saik) = 0}). (5) 



Since all the channels are assumed to be identical and the statistics of the observations are 
assumed to be constant over time, e given by ([5]) is a constant independent of k. From the 
structure of the expected reward in ^ it follows that we can redefine our problem such that the 
reward in slot k is now given by, 

rfc = 5(1 - e)X|5„jfc)=o} (6) 

oo 

and the optimization problem is equivalent to maximizing >^a^E[rfe]. Since we know the 

fc=0 

structure of the optimal access decisions from ([2]), the problem of spectrum sensing and access 
boils down to choosing the optimal channel to sense in each slot. Whenever the secondary 
users sense some channel and make observations with LLR lower than the threshold, they are 
free to access that channel. Thus we have converted the constrained POMDP problem into an 
unconstrained POMDP problem as was done in [HI. 



III. Dynamic programming 
The state of the system in slot k denoted by 

S{k) = {Si{k),S2{k),...,SL{k)y 

is the vector of states of the channels in C that have independent and identical Markovian 
evolutions. The channel to be sensed in slot k is decided in slot k — I and is given by 

Uk = fJ'kih-i) 

where /i^ is a deterministic function and /^ = {X^iu'^) represents the net information about 
past observations and decisions up to slot k. The reward obtained in slot A; is a function of 
the state in slot k and Uk as given by ^. We seek the sequence of channels {mo,Mi,...}, 

CO 

such that >^a'''E[rfe] is maximized. It is easily verified that this problem is a standard dynamic 

fc=0 

programming problem with imperfect observations. It is known [|71 that for such a POMDP 
problem, a sufficient statistic at the end of any time slot k, is the probability distribution of the 
system state S{k), conditioned on all the past observations and decisions, given by P{{S_{k) = 
s}\Ik). Furthermore, since the Markovian evolution of the different channels are independent 
of each other, this conditional probability distribution is equivalently represented by the set of 
beliefs about the occupancy states of each channel, i.e., the probability of occupancy of each 
channel in slot k, conditioned on all the past observations on channel a and times when channel 
a was sensed. We use Pa{k) to represent the belief about channel a at the end of slot k, i.e., Paik) 
is the probability that the state Sa{k) of channel a in slot k is 1, conditioned on all observations 
and decisions up to time slot k, which is given by 

We use p(k) to denote the L x 1 vector representing the beliefs about the channels in C. The 
initial values of the belief parameters for all channels are set using the stationary distribution of 
the Markov chain. We use P to represent the transition probability matrix for the state transitions 
of each channel, with P{i,j) representing the probability that a channel that is in state i in slot 
k switches to state j in slot k + 1. We define, 

qaik) = P(l, l)paik - 1) + P(0, 1)(1 - Paik - 1)). (7) 



This Qaik) represents the probability of occupancy of channel a in slot k, conditioned on the 

observations up to slot k — I. Using Bayes' rule, the belief values are updated as follows after 

the observation in time slot k: 

.,. ^ qaik)MMk)) 

^"^ ' qa{k)h{X,{k)) + (1 - qa{k))h{X^{k)) ^ ^ 

when channel a was selected in slot k (i.e., Uk = a), and Pa{k) = Qaik) otherwise. Thus from 

([8]) we see that updates for the sufficient statistic can be performed using only the joint LLR 

of the observations, £(X^(/c)), instead of the entire vector of observations. Furthermore, from 

^ we also see that the access decisions also depend only on the LLRs. Hence we conclude 

that this problem with vector observations is equivalent to one with scalar observations where 

the scalars represent the joint LLR of the observations of all the cooperating secondary users. 

Therefore, in the rest of this paper, we use a scalar observation model with the observation made 

on channel a in slot k represented by Ya{k). We use Y^ to denote the set of all observations up 

to time slot k and Y^ to denote the set of all observations on channel a up to slot k. 

Hence the new access decisions are given by 

6a{k) = I{C'iYaik))<A'}I{uk=a} (9) 

where C'{Ya{k)) represents the LLR of Ya{k) and the access threshold A' is chosen to satisfy, 

P{{C'{Ya{k)) < A'}|{S„(fc) = 1}) = C- (10) 

Similarly the belief updates are performed as in ^ with the evaluations of density functions of 
X^(A;) replaced with the evaluations of the density functions /q and /{ of Ya{k): 

, (M qaik)f[{Ya{k)) 

^"^ ^ qa{k)n{Ya{k)) + (1 - qa{k))f[,{Ya{k)) ^ ^ 

when channel a is accessed in slot k (i.e., Uk = a), and Pa{k) = Qaik) otherwise. We use 
G{p{k — l),Uk,Yuf.{k)) to denote the function that returns the value of p{k) given that channel 
Uk was sensed in slot k. This function can be calculated using the relations (|7]) and (fTTI) . The 
reward obtained in slot k can now be expressed as, 

Tk = 5(l-e)X{5„^(fe)=o} (12) 

where e is given by 

e = PiiC'iYaik)) > A'}\{Saik) = 0}). (13) 



From the structure of the dynamic program, it can be shown that the optimal solution to this 
dynamic program can be obtained by solving the following Bellman equation [|7]| for the optimal 
reward-to-go function: 

J{p) = max[S(l - e)(l - g„) + «E(J(G'(p,M, F„)))] (14) 

where p represents the initial value of the belief vector, i.e., the prior probability of channel 
occupancies in slot —1, and q is calculated from p as in ([7]) by, 

qa = P(l, l)Pa + P(0, 1)(1 - Pa), ^ G C. (15) 

The expectation in ((14)) is performed over the random observation Y^. Since it is not easy to 
find the optimal solution to this Bellman equation, we adopt a suboptimal strategy to obtain a 
channel selection policy that performs well. 

In the rest of the paper we assume that the transition probability matrix P satisfies the following 
regularity conditions: 

Assumption 1 : < P{i,i) < 1, j G {0, 1} (16) 

Assumption 2 : P(0, 0) > P(l, 0) (17) 

The first assumption ensures that the resultant Markov chain is irreducible and positive recurrent, 
while the second assumption ensures that it is more likely for a channel that is free in the current 
slot to remain free in the next slot than for a channel that is occupied in the current slot to switch 
states and become free in the next slot. While the first assumption is important the second one 
is used only in the derivation of the upper bound on the optimal performance and can easily be 
relaxed by separately considering the case where the inequality (flTl) does not hold. 

A. Greedy policy 

A straightforward suboptimal solution to the channel selection problem is the greedy policy, 
i.e., the policy of maximizing the expected instantaneous reward in the current time slot. The 
expected instantaneous reward obtained by accessing some channel a in a given slot k is given 
by B{\ — e)(l — qa{k)) where e is given by (fT3l) . Hence the greedy policy is to choose the 
channel a such that 1 — qa{k) is the maximum. 

uf = argmax {1 - quik)}. (18) 

uec 
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In other words, in every slot the greedy policy chooses the channel that is most likely to be free, 
conditioned on the past observations. The greedy policy for this problem is in fact equivalent to 
the Qmdp policy, which is a standard suboptimal solution to the POMDP problem (see, e.g., [f8l|). 
It is shown in [1] and [2J that under some conditions on P and L, the greedy policy is optimal if 
the observation in each slot reveals the underlying state of the channel. Hence it can be argued 
that under the same conditions, the greedy policy would also be optimal for our problem at high 
SNR. 

B. An upper bound 

An upper bound on the optimal reward for the POMDP of (fT4l) can be obtained by assuming 
more information than the maximum that can be obtained in reality. One such assumption that 
can give us a simple upper bound is the Qmdp assumption [8], which is to assume that in all 
future slots, the state of all channels become known exactly after making the observation in that 
slot. The optimal reward under the Qmdp assumption is a function of the initial belief vector, 
i.e., the prior probabilities of occupancy of the channels in slot —1. We represent this function 
by J^ . In practice, a reasonable choice of initial value of the belief vector is given by the 
stationary distribution of the Markov chains. Hence for any solution to the POMDP that uses 
this initialization, an upper bound for the optimal reward under the Qmdp assumption is given by 
ju _ jQ[p*'\^ where p* represents the probability that a channel is occupied under the stationary 
distribution of the transition probability matrix P, and 1 represents an L x 1 vector of all I's. 

The first step involved in evaluating this upper bound is to determine the optimal reward 
function under the assumption that all the channel states become known exactly after making 
the observation in each slot including the current slot. We call this function J . That is, we want 
to evaluate J(x) for all binary strings x of length L that represent the 2^ possible values of the 
vector representing the states of all channels in slot —1. The Qmdp assumption implies that the 
functions J^ and J satisfy the following equation: 

J«U) = max{[5(l-e)(l-g„) + 

Yl «P({i^(0) = x})J{x)]} s.t. p(-l) = z (19) 

where p(— 1) denotes the a priori belief vector about the channel states in slot —1 and g„ is 
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obtained from pu(— 1) just as in (fT5l) . Hence the upper bound J^ = J'^ip*!) can be easily 
evaluated using P once the function J is determined. 

Now we describe how one can solve for the function J. Under the assumption that the states of 
all the channels become known exactly at the time of observation, the optimal channel selected 
in any slot k would be a function of the states of the channels in slot k — I. Moreover, the 
sensing action in the current slot would not affect the rewards in the future slots. Hence the 
optimal policy would be to maximize the expected instantaneous reward, which is achieved by 
accessing the channel that is most likely to be free in the current slot. Now under the added 
assumption stated in (fTTI) earlieiO, the optimal policy would always select some channel that was 
free in the previous time slot, if there is any. If no channel is free in the previous time slot, then 
the optimal policy would be to select any one of the channels in C, since all of them are equally 
likely to be free in the current slot. Hence the derivation of the optimal total reward for this 
problem is straightforward as illustrated below. The total reward for this policy is a function of 
the state of the system in the slot preceding the initial slot, i.e., S_{—1). 



J{x) 



max E 



K 



[%4o)=o} + «J(^(0))] 



{Si-l)=x} 



kP{0, 0) + aV{x) if X ^ 1 
KP{l,0) + aV{x) if X = 1 

where V{x) = E[J{S{0))\{S{-1) = x}], k = B{1- e), and 1 is an L x 1 string of all I's. This 
means that we can write 

f oo 

J{x) = KJp(0,0)^a^'-(P(0,0)-P(l,0))^(x) 



k=0 

^/:^(^_(P(o,o)-p(i,o)Mx) 

i — a 



(20) 



where 



wix] 



„M+1 



{^(-1) = X} 



J2 «■ 

M>-1:S(M)=1 

is a scalar function of the vector state x. Here the expectation is over the random slots when 



'it is easy to see that a minor modification of tiie derivation of the upper bound works when assumption J17t does not hold. 
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the system reaches state 1. Now by stationarity we have, 



w{x) 



J2 « 

A/>0:5(M)=1 



M 



{S{0) = x} 



(21) 



We use P to denote the matrix of size 2^ x 2^ representing the transition probability matrix of 
the joint Markov process that describes the transitions of the vector of channel states S_{k). The 
{i,jy^ element of P represents the probability that the state of the system switches to y in slot 
k + 1 given that the state of the system is x in slot k, where x is the L-bit binary representation 
of z — 1 and y is the L-bit binary representation of j — 1. Using a slight abuse of notation we 
represent the (i, j)* element of P as f{x,y) itself. Now equation (|2TI) can be solved to obtain, 



WiXj 



^«P(x,yV(y)+X{ 



x=l}- 



(22) 



This fixed point equation which can be solved to obtain. 



w 



{I-al 



(23) 




\ 1 / 

^ '2^x1 

where w is a 2^ x 1 vector whose elements are the values of the function w{x) evaluated at the 
2^ different possible values of the vector state x of the system in time slot —1. Again, the i* 
element of vector w is w{x) where x is the L-bit binary representation of z — 1. Thus J can now 
be evaluated by using relation (|20|) and the expected reward for this problem under the Qmdp 
assumption can be calculated by evaluating J'^ = J'^{p*l) via equation (fT9l ). This optimal value 
yields an analytical upper bound on the optimal reward of the original problem (fT4l) . 



C. Comparison with /I4|/ for single user problem 

Although we have studied a spectrum access scheme for a cooperative cognitive radio network, 
it can also be employed by a single cognitive user. Under this setting, our approach to the 
spectrum access problem described earlier in this section is similar to that considered in [4J and 
|l3l in that sensing does not reveal the true channel states but only a random variable whose 
distribution depends on the current state of the sensed channel. As a result, the structure of our 
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optimal access policy and the sufficient statistic are similar to those in [4J. In this section we 
compare the two schemes. 

The main difference between our formulation and that in HI is that in our formulation the 
secondary users use the primary signal received on the channel to track the channel occupancies, 
while in [4^ they use the ACK signals exchanged between the secondary transmitter and receiver. 
Under the scheme of [|4l|, in each slot, the secondary receiver transmits an ACK signal upon 
successful reception of a transmission from the secondary receiver. The belief updates are then 
performed using the single bit of information provided by the presence or absence of the ACK 
signal. The approach of Pl was motivated by the fact that, under that scheme, the secondary 
receiver knows in advance the channel on which to expect potential transmissions from the 
secondary transmitter in each slot, thus obviating the need for control channels for synchro- 
nization. However, such synchronization between the transmitter and receiver is not reliable in 
the presence of interfering terminals that are hidden [9] from either the receiver or transmitter, 
because the ACK signals will no longer be error-free. In this regard, we believe that a more 
practical solution to this problem would be to set aside a dedicated control channel of low capacity 
for the purpose of reliably maintaining synchronization, and use the observations on the primary 
channel for tracking the channel occupancies. In addition to guaranteeing synchronization, our 
scheme provides some improvement in utilizing transmission opportunities over the ACK-based 
scheme, as we show in section l\^Al 

Another difference between our formulation and that in Q is that we assume that the statistics 
of channel occupancies are independent and identical while H considers the more general case 
of correlated and non-identical channels. However, the scheme we proposed in section Ull] can be 
easily modified to handle this case, with added complexity. The sufficient statistic would now be 
the posteriori distribution of S_{k), the vector of states of all channels, and the access thresholds 
on different channels would be non-identical and depend on the statistics of the observations the 
respective channels. We avoid elaborating on this more general setting to keep the presentation 
simple. 

IV. The case of unknown distributions 

In practice, the secondary users are typically unaware of the primary's signal characteristics 
and the channel realization from the primary fllOll . Hence cognitive radio systems have to rely on 
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some form of non-coherent detection such as energy detection while sensing the primary signals. 
Furthermore, even while employing non-coherent detectors, the secondary users are also unaware 
of their locations relative to the primary and hence are not aware of the shadowing and path 
loss from the primary to the secondary. Hence it is not reasonable to assume that the secondary 
users know the exact distributions of the observations under the primary-present hypothesis, 
although it can be assumed that the distribution of the observations under the primary-absent 
hypothesis is known exactly. This scenario can be modeled by using a parametric description 
for the distributions of the received signal under the primary-present hypothesis. We denote the 
density functions of the observations under the two possible hypotheses as, 

Sa{k) = : Y,{k)^fe, 

Sa{k) = l : Ya{k)^fe^ 

where 0^ e 0, Va G {1, 2, . . . , L} (24) 

where the parameters {6a} are unknown for all channels a, and Oq is known. We use jCg{.) to 
denote the log-likelihood function under fg defined by, 

Cg{x) ^ log (^^] , xeR,eee. (25) 

In this section, we study two possible approaches for dealing with such a scenario, while 
restricting to greedy policies for channel selection. For ease of illustration, in this section we 
consider a secondary system comprised of a single user, although the same ideas can also be 
applied for a system with multiple cooperating users. 

A. Worst-case design for non-random 6a 

A close examination of Section HII] reveals two specific uses for the density function of the 
observations under the Sa{k) = 1 hypothesis. The knowledge of this density was of crucial 
importance in setting the access threshold in (flOl) to meet the constraint on the probability of 
interference. The other place where this density was used was in updating the belief probabilities 
in (fTTl) . When the parameters {9a} are non-random and unknown, we have to guarantee the 
constraint on the interference probability for all possible realizations of 9a. The optimal access 
decision would thus be given by, 

6a{k) = I{uk=a} [[ 1-{Ce{Ya{k))<Tg} (26) 
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where tq satisfies, 

P{{Ce{Y^m < re}\{Saik) = 1,9^ = 9}) = (■ (27) 

The other concern that we need to address in this approach is: what distribution do we use for 
the observations under Sa{k) = 1 in order to perform the updates in (fTTI) . An intelligent solution 
is possible provided the densities described in (|24|) satisfy the condition that there is a ^* G 
such that the for all ^ G and for all r G M the following inequality holds: 

PaCe^yaik)) > r}\{Sa{k) = l,9a = 9}) > 

P({£,.(r„(A;)) > r}\{Saik) = 1,9^ = 9*}). (28) 

The condition (l28l) is satisfied by several parameterized densities including an important practical 
example discussed later. Under condition (|28|) . a good suboptimal solution to the channel selection 
problem would be to run the greedy policy for channel selection using fg* for the density under 
Sa{k) = 1 while performing the updates of the channel beliefs in (fTTI) . This is a consequence 
of the following lemma. 

Lemma 4.1: Assume condition (l28l) holds. Suppose fg is used in place of f[ for the distribution 
of the observations under Sa{k) = 1 while performing belief updates in (fTTj) . Then, 
(i) For all 7 G and for all /3,p G [0, 1], 

P,i{Paik) > ^}\{Saik) = l,paik - 1) = p}) > 

Pe4{Pa{k) > P}\{Sa{k) = l,Pa{k - 1) = p}) (29) 

where Pg represents the probability measure when 9a = 9. 
(ii) Conditioned on {Sa{k) = 0}, the distribution of Paik) given any value for pa{k — 1) is 

identical for all possible values of 9a. 

Proof: (i) Clearly (|29l) holds with equality when channel a is not sensed in slot k (i.e. 
Uk 7^ a). When Uk = a, it is easy to see that the new belief given by (fTTI) is a monotonically 
increasing function of the log-likelihood function, Cg* (Ya{k)). Hence (f29l) follows from condition 
(f28]). 

(ii) This is obvious since the randomness in Paik) under {Saik) = 0} is solely due to the 
observation Ya{k) whose distribution /g^ does not depend on 9a. ■ 

Clearly, updating using fg* in (fTTI) is optimal if 9a = 9*. When 9a ^ 9*, the tracking of beliefs 
are guaranteed to be at least as accurate, in the sense described in Lemma 14.11 Hence, under 
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condition (|28l) . a good suboptimal solution to the channel selection problem would be to run the 
greedy policy for channel selection using /q* for the density under Sa{k) = 1 while performing 
the updates of the channel beliefs in (fTTI) . Furthermore, it is known that fTTll under condition 
(|28] ). the set of likelihood ratio tests in the access decision of (|26l ) can be replaced with a single 
likelihood ratio test under the worst case parameter 9* given by, 

6aik) = I{u„=a}I{Co,{Ya{k))<Te*}- (30) 

The structure of the access decision given in (l30l) . and the conclusion from Lemma 1431 suggests 
that 9* is a worst-case value of the parameter 9a. Hence the strategy of designing the sensing and 
access policies assuming this worst possible value of the parameter is optimal in the following 
min-max sense: The average reward when the true value of 9a ^ 9* is expected to be no smaller 
than that obtained when 9a = 9* since the tracking of beliefs is worst when 9a = 9* as shown 
in Lemma 14.11 This intuitive reasoning is seen to hold in the simulation results in Section IV-BI 

B. Modeling 9a as random 

In Section IV-B[ we show through simulations that the worst-case approach of the previous 
section leads to a severe decline in performance relative to the scenario where the distribution 
parameters in (|24l) are known accurately. In practice it may be possible to learn the value of 
these parameters online. In order to learn the parameters {9a} we need to have a statistical 
model for these parameters and a reliable statistical model for the channel state process. In this 
section we model the parameters {9a} as random variables, which are i.i.d. across the channels 
and independent of the Markov process as well as the noise process. In order to assure the 
convergence of our learning algorithm, we also assume that the cardinality of set 6 is finite 
and let |0| = A^. Let {/Uj}^ denote the elements of set 6. The prior distribution of the parameters 
{9a} is known to the secondary users. The beliefs of the different channels no longer form a 
sufficient statistic for this problem. Instead, we keep track of the following set of a posteriori 
probabilities which we refer to as joint beliefs: 

{P{{{9a,Sam = (/i„j)}|4) : Vz,j,a}. (31) 

""We do discuss the scenario when O is a compact set in the example considered in Section FV-B I 
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Since we assume that the parameters {6a} take values in a finite set, we can keep track of these 
joint beliefs just as we kept track of the beliefs of the states of different channels in Section 
Hm For the initial values of these joint beliefs we use the product distribution of the stationary 
distribution of the Markov chain and the prior distribution on the parameters {9a}- We store 
these joint beliefs at the end of slot A; in an L x A^ x 2 array Q{k) with elements given by, 

Qa,Ak) = P{{{ea, Sa{k)) = {fi,, j)}\Y^\ K'J. (32) 

The entries of the array Q{k) corresponding to channel a represent the joint a posteriori prob- 
ability distribution of the parameter 9a and the state of channel a in slot k conditioned on the 
information available up to slot k which we called Ik- Now define, 

£G{0,1} 

Again, the values of the array H(k) represent the a posteriori probability distributions about the 
parameters {9a} and the channel states in slot k conditioned on Ik-i, the information up to slot 
k — 1. The update equations for the joint beliefs can now be written as follows: 

n fu^-i >^HaMk)fooiYaik))i{j = 

[ \Ha,Ak)f,Xya{k))iij = l 

when channel a was accessed in slot k, and QaAj{k) = Ha^ij{k) otherwise. Here A is just a 
normalizing factor. 

It is shown in Appendix that, for each channel a, the a posteriori probability mass function 
of parameter 9a conditioned on the information up to slot k, converges to a delta-function at the 
true value of parameter 9a as k ^ oo, provided we sense channel a frequently enough. This 
essentially means that we can learn the value of the actual realization of 9a by just updating 
the joint beliefs. This observation suggests that we could use this knowledge learned about the 
parameters in order to obtain better performance than that obtained under the policy of Section 
IIV-A[ We could, for instance, use the knowledge of the true value of 9a to be more liberal in 
our access policy than the satisfy-all-constraints approach that we used in Section IIV-AI when 
we did a worst-case design. With this in mind, we propose the following algorithm for choosing 
the threshold to be used in each slot for determining whether or not to access the spectrum. 

Assume channel a was sensed in slot k. We first arrange the elements of set in increasing 
order of the a posteriori probabilities of parameter 9a. We partition into two groups, a 'lower' 
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partition and an 'upper' partition, such that all elements in the lower partition have lower a 
posteriori probability values than all elements in the upper partition. The partitioning is done 
such that the number of elements in the lower partition is maximized subject to the constraint 
that the a posteriori probabilities of the elements in the lower partition add up to a value lower 
than (^. These elements of can be ignored while designing the access policy since the sum 
of their a posteriori probabilities is below the interference constraint. We then design the access 
policy such that we meet the interference constraint conditioned on parameter 9a taking any 
value in the upper partition. The mathematical description of the algorithm is as follows. Define 

ie{o,i} 
The vector {b\(k) , b1{k) , . . . , b^ (k))^ represents the a posteriori probability mass function of 
parameter 9a conditioned on h-i, the information available up to slot k — 1. Now let nk{i) : 
{1,2,...,A^} !-)■ {1,2, ...,A^} be a permutation of {1,2,...,A^} such that {nTTk{i)}iLi are 
arranged in increasing order of posteriori probabilities, i.e. 

i > J ^ bl'^^'\k) >bl^^^\k) 

c 

and let Na{k) = max{c < A^ : '^bl^'^'^k) < (}. Now define set 0„(A;) = {/i^,(i) : i > Na{k)}. 

i=l 

This set is the upper partition mentioned earlier. The access decision on channel a in slot k is 
given byjfl 

6a{k) = X{uk=a} W 'X{Co(Ya{k))<r0} (33) 

0eea(fc) 
where tq satisfy (l27l) . The access policy given above guarantees that 



P{{6a{k) = l]\{Sa{k) = l],Y'-\K'~') < C (34) 

whence the same holds without conditioning on Y^^^ and K'^^^. Hence, the interference con- 
straint is met on an average, averaged over the posteriori distributions of 9a- Now it is shown 
in Appendix that the a posteriori probability mass function of parameter 9a converges to a delta 

^The access policy obtained via the partitioning scheme is simple to implement but is not the optimal policy in general. The 
optimal access decision on channel a in slot k would be given by a likelihood-ratio test between fe^ and the mixture density 
X^eee ^s{k — l)fe where re{k — 1) represents the value of the posterior distribution of 9a after slot fe — 1, evaluated at 6. 
However setting thresholds for such a test is prohibitively complex. 
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function at the true value of parameter 9a almost surely. Hence the constraint is asymptotically 
met even conditioned on 9a taking the correct value. This follows from the fact that, if /ij^, is the 
actual realization of the random variable 9a, and &"(^) converges to 1 almost surely, then, for 
sufficiently large k, (|33] ) becomes: Saik) = Xs^uk=a}^{c (Ya(k))<T } with probability one and 
hence the claim is satisfied. 

It is important to note that the access policy given in (|33] ) need not be the optimal access 
policy for this problem. Unlike in Section UIl here we are allowing the access decision in slot 
k to depend on the observations in all slots up to k via the joint beliefs. Hence, it is no longer 
obvious that the optimal test should be a threshold test on the LLR of the observations in the 
current slot even if parameter 9a is known. However, this structure for the access policy can be 
justified from the observation that it is simpler to implement in practice than some other policy 
that requires us to keep track of all the past observations. The simulation results that we present 
in Section IV-BI also suggest that this scheme achieves substantial improvement in performance 
over the worst-case approach, thus further justifying this structure for the access policy. 

Under this scheme the new greedy policy for channel selection is to sense the channel which 
promises the highest expected instantaneous reward which is now given by. 



N 



< 



f = argmax <^ V" ha,ifl{k){l - ea{k)) \ (35) 

where 

ea{k) = pl U {/:e{Ya{k))>re}{Sa{k)=0} 
\eeea(k) 

However, in order to prove the convergence of the a posteriori probabilities of the parameters 

{9a}, we need to make a slight modification to this channel selection policy. In our proof, we 

require that each channel is accessed frequently. To enforce that this condition is satisfied, we 

modify the channel selection policy so that the new channel selection scheme is as follows: 

—_^ f C, if A; = J mod CL, j eC 

Uk = { —, (36) 

I uf else 

where C > 1 is some constant and {Cj : I < j < L} is some ordering of the channels in C. 



p 
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V. Simulation results and comparisons 
A. Known distributions 

We consider a simple model for the distributions of the observations and illustrate the ad- 
vantage of our proposed scheme over that in dU by simulating the performances obtained by 
employing the greedy algorithm on both these schemes. We also consider a combined scheme 
that uses both the channel observations and the ACK signals for updating beliefs. 

We simulated the greedy policy under three different schemes. Our scheme, which we call Gi, 
uses only the observations made on the channels to update the belief vectors. The second one, 
G2, uses only the ACK signals transmitted by the secondary receiver, while the third one, G3, 
uses both observations as well as the error-free ACK signals. We have performed the simulations 
for two different values of the interference constraint (^. The number of channels was kept at 
L = 2 in both cases and the transition probability matrix used was, 

0.9 0.1 
0.2 0.8 

where the first index represents state and the second represents state 1. Both channels were 
assumed to have unit bandwidth, B = I and the discount factor was set to a = 0.999. Such 
a high value of a was chosen to approximate the problem with no discounts which would be 
the problem of practical interest. As we saw in Section Hill the spectrum access problem with 
a group of cooperating secondary users is equivalent to that with a single user. Hence, in our 
simulations we use a scalar observation model with the following simple distributions for Ya{k) 
under the two hypotheses: 

S^{k) = (primary OFF) : Yaik) ~ ^f{0, a^) 
5« (A;) = 1 (primary ON) : r,(A;) ~ A/'(/i, a^) (37) 

It is easy to verify that the LLR for these observations is an increasing linear function of Ya{k). 
Hence the new access decisions are made by comparing Ya{k) to a threshold r chosen such that, 

PiiYaik) < T}\{Saik) = 1}) = C (38) 

and access decisions are given by, 

6a{k) = I{Y4k)<T}^{uk=a}- (39) 
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Fig. 1. Comparison of performances obtained with greedy policy that uses observations and greedy policy that uses ACKs. 
Performance obtained with greedy policy that uses both ACKs as well as observations and the upper bound are also shown. 



The belief updates in ([8]) are now given by, 

qaik)fi^i,a^Yaik)) 



Pa{k) 



qaik)fifi, a\ Y,{k)) + (1 - g,(fc))/(0, a\ Y^^k)) 
when channel a was selected in slot k (i.e. Uk = a), and Paik) = qa{k) otherwise. Here qa{k) is 
given by ([7]) and f{x, y, z) represents the value of the Gaussian density function with mean x and 
variance y evaluated at z. For the mean and variance parameters in (|37|) we use a = \ and choose 
ji so that SNR = lOlog-^Q^fi / a) takes values from —5 dB to 5 dB. In the case of cooperative 
sensing, this SNR represents the effective signal-to-noise ratio in the joint LLR statistic at the 
decision center, C{X_^{k)). We perform simulations for two values of the interference constraint, 
C = 0.1 and C = 0.01. 

As seen in Fig. [H the strategy of using only ACK signals {G2) performs worse than the one 
that uses all the observations (Gi), especially for C, = 0.01, thus demonstrating that relying 
only on ACK signals compromises on the amount of information that can be learned. We also 
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observe that the greedy policy attains a performance that is within 10% of the upper bound. It 
is also seen in the figure that the reward values obtained under Gi and G3 are almost equal. 
For (^ = 0.01, it is seen that the two curves are overlapping. This observation suggests that 
the extra advantage obtained by incorporating the ACK signals is insignificant especially when 
the interference constraint is low. The explanation for this observation is that the ACK signals 
are received only when the signal transmitted by the secondary transmitter successfully gets 
across to its receiver. For this to happen the state of the primary channel should be '0' and the 
secondary must decide to access the channel. When the value of the interference constraint (^ 
is low, the secondary accesses the channel only when the value of the Ya{k) is low. Hence the 
observations in this case carry a significant amount of information about the states themselves 
and the additional information that can be obtained from the ACK signals is not significant. 
Thus learning using only observations is almost as good as learning using both observations as 
well as ACK signals in this case. 

B. Unknown distributions 

We compare the performances of the two different approaches to the spectrum access problem 
with unknown distributions that we discussed in Section |Wl We use a parameterized version 
of the observation model we used in the example in Section IV-A[ We assume that the primary 
and secondary users are stationary and assume that the secondary user is unaware of its location 
relative to the primary transmitter. We assume that the secondary user employs some form of 
energy detection, which means that the lack of knowledge about the location of the primary 
manifests itself in the form of an unknown mean power of the signal from the primary. Using 
Gaussian distributions as in Section IV-Al we model the lack of knowledge of the received 
primary power by assuming that the mean of the observation under Hi on channel a is an 
unknown parameter 6a taking values in a finite set of positive values 0. The new hypotheses 
are: 

Sa{k)=0 : Ya{k)=Na{k) 

Sa{k) = l : Ya{k)=ea + Na{k) 

where Naik) ~ ^f{0,a^),ea e e,mm{e) > 0. (40) 
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For the set of parameterized distributions in (|40l ). the log-likelihood ratio function Cg^x) 
defined in (|25l) is linear in x for all 9 E Q. Hence comparing Cg{Ya{k)) to a threshold is 
equivalent to comparing Ya{k) to some other threshold. Furthermore, for this set of parameterized 
distributions, it is easy to see that the conditional cumulative distribution function (cdf) of the 
observations Ya{k) under Hi, conditioned on 9a taking value 9, is monotonically decreasing in 9. 
Furthermore, under the assumption that min 6 > 0, it follows that choosing 9* = min 9 satisfies 
the conditions of (l28l) . Hence the optimal access decision under the worst-case approach given 
in (|30l) can be written as 

6a{k) = I^Uk=a}^{Yaik)<T^} (41) 

where r^ satisfies 

P{{Ya{k) < r^}\{Sa{k) = l,9a = 9*}) = C (42) 

where 9* = min0. Thus the worst-case solution for this set of parameterized distributions is 
identical to that obtained for the problem with known distributions described in (|37|) with /i 
replaced by 9*. Thus the structures of the access policy, the channel selection policy, and the 
belief update equations are identical to those derived in the example shown in Section IV-AI with 
/i replaced by ^*. 

Similarly, the access policy for the case of random 9a parameters given in (l33l) can now be 
written as 

Sa{k) = X{uk=a}I{Ya{k)<Tr(k)} (43) 

where Tj.{k) satisfies 

P{{Ya{k) < Tr{k)}\{Sa{k) = l,9a = 9* {k)}) = ( (44) 

where 9*{k) = lamQaik). The belief updates and greedy channel selection are performed as 
described in Section ITV-B[ The quantity ea(A;) appearing in (l35l) can now be written as 

ea{k) = Pi{Ya{k) > rr{k)}\{Sa{k) = 0}). 

We simulated the performances of both the schemes on the hypotheses described in (|40|) . We 
used the same values of L, P, a and a as in Section IV-AI We chose set 6 such that the SNR 
values in dB given by 20 log— belong to the set {— 5, — 3, — 1, 1, 3, 5}. The prior probability 
distribution for 9^ was chosen to be the uniform distribution on 6. The interference constraint 
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( was set to 0.01. Both channels were assumed to have the same values of true SNR while the 
simulations were performed. The reward was computed over 10000 slots since the remaining 
slots do not contribute significantly to the reward. The value of C in (|36l ) was set to a value 
higher than the number of slots considered so that the greedy channel selection policy always 
uses the second alternative in (|36l) . Although we require (l36l) for our proof of convergence of 
the a posteriori probabilities in the Appendix, it was observed in simulations that this condition 
was not necessary for convergence. 

The results of the simulations are given in Fig. [2l The net reward values obtained under the 
worst-case design of Section IIV-AI and that obtained with the algorithm for learning 9a given in 
Section IIV-BI are plotted. We have also included the rewards obtained with the greedy algorithm 
Gi with known 9a values; these values denote the best rewards that can be obtained with the 
greedy policy when the parameters 9a are known exactly. Clearly, we see that the worst-case 
design gives us almost no improvement in performance for high values of actual SNR. This is 
because the threshold we choose is too conservative for high SNR scenarios leading to several 
missed opportunities for transmitting. The minimal improvement in performance at high SNR is 
due to the fact that the system now has more accurate estimates of the channel beliefs although 
the update equations were designed for a lower SNR level. The learning scheme, on the other 
hand, yields a significant performance advantage over the worst-case scheme for high SNR values 
as seen in the figure. It is also seen that there is a significant gap between the performance with 
learning and that with known 9a values at high SNR values. This gap is due to the fact that 
the posteriori probabilities about the 9a parameters take some time to converge. As a result of 
this delay in convergence a conservative access threshold has to be used in the initial slots thus 
leading to a drop in the discounted infinite horizon reward. However, if we were using an average 
reward formulation for the dynamic program rather than a discounted reward formulation, we 
would expect the two curves to overlap since the loss in the initial slots is insignificant while 
computing the long-term average reward. 

Remark 5.1: So far in this paper, we have assumed that the cardinality of set 6 is finite. 
The proposed learning algorithm can also be adapted for the case when 6 is a compact set. A 
simple example illustrates how this may be done. Assuming parameterized distributions of the 
form described in (l40l) . suppose that the value of 9a in dB is uniformly distributed in the interval 
[—4.5,4.5] and that we compute the posteriori probabilities of 9a assuming that its value in dB 
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Fig. 2. Comparison of performances obtained with worst-case approach and learning approach. Performance obtained with 
greedy policy is also shown. The value of (^ = 0.01. 



is quantized to the finite set = {— 5, — 4, . . . , 5}. Now if the actual realization of 9a is between 
1 dB and 2 dB, say 1.5 dB, then we expect to see low posteriori probabilities for all elements of 
6 except 1 dB and 2 dB and in this case it would be safe to set the access threshold assuming 
an SNR of 1 dB. Although this threshold is not the best that can be set for the actual realization 
of 9a, it is still a significant improvement over the worst-case threshold which would correspond 
to an SNR of —4.5 dB. We expect the a posteriori probabilities of all elements of 6 other than 
1 dB and 2 dB to converge to 0, but the a posteriori probabilities of these two values may not 
converge; they may oscillate between and 1 such that their sum converges to 1. A rigorous 
version of the above argument would require some ordering of the parameterized distributions 
as in (|281). 
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VI. Conclusions and discussion 

The results of Section IV-AI and the arguments we presented in Section ITlI-C I clearly show that 
our scheme of estimating the channel occupancies using the observations yields performance 
gains and may have practical advantages over the ACK-based scheme that was proposed in flU. 
We believe that these advantages are significant enough to justify using our scheme even though 
it necessitates the use of dedicated control channels for synchronization. 

For the scenario where the distributions of the received signals from the primary transmitters 
are unknown and belong to a parameterized family, the simulation results in Section IV-BI suggest 
that designing for worst-case values of the unknown parameters can lead to a significant drop in 
performance relative to the scenario where the distributions are known. Our proposed learning- 
based scheme overcomes this performance drop by learning the primary signal's statistics. The 
caveat is that the learning procedure requires a reliable model for the state transition process 
if we need to give probabilistic guarantees of the form (l34l) and to ensure convergence of the 
beliefs about the 9a parameters. 

In most of the existing literature on sensing and access policies for cognitive radios that 
use energy detectors, the typical practice is to consider a worst-case mean power under the 
primary-present hypothesis. The reasoning behind this approach is that the cognitive users have to 
guarantee that the probability of interfering with any primary receiver located within a protected 
region [fTOl . [|9l around the primary transmitter is below the interference constraint. Hence it is 
natural to assume that the mean power of the primary signal is the worst-case one, i.e., the mean 
power that one would expect at the edge of the protected region. However, the problem with this 
approach is that by designing for the worst-case distribution, the secondary users are forced to 
set conservative thresholds while making access decisions. Hence even when the secondary users 
are close to the primary transmitter and the SNR of the signal they receive from the primary 
transmitter is high, they cannot efficiently detect vacancies in the primary spectrum. Instead, 
if they were aware that they were close to the transmitter they could have detected spectral 
vacancies more efficiently as demonstrated by the improvement in performance at higher SNRs 
observed in the simulation example in Section IV-AI This loss in performance is overcome by the 
learning scheme proposed in Section IIV-BI By learning the value of 9a the secondary users can 
now set more liberal thresholds and hence exploit vacancies in the primary spectrum better when 
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they are located close to the primary transmitter. Thus, using such a scheme would produce a 
significant performance improvement in overall throughput of the cognitive radio system. 

Appendix 

Here we show that for each channel a, the a posteriori probability mass function of parameter 
9a converges to a delta-function at the true value of the parameter almost surely under the 
algorithm described in Section IIV-B[ 

Theorem A.l: Assume that the transition probability matrix P satisfies (fT6l) . Further assume 
that the conditional densities of the observations given in (|24l) satisfy 



/ /^^ (y) I log(/^^ {y))\dy< oo for all /i^ G 6, (45) 



and that all densities in (|24|) are distinct. Then, under the channel sensing scheme that was 
introduced in (|36l) . for each channel a, 

P{{ea = fi^}\Y^) -^^^ X{,„=;..}, for all /i, e e. 

Proof: Without loss of generality, we can restrict ourselves to the proof of the convergence 
of the a posteriori distribution of 9i, the parameter for the first channel. By the modified sensing 
scheme introduced in (|36l) . it can be seen that channel 1 is sensed at least every ML slots. Hence, 
if the a posteriori distribution converges for an algorithm that senses channel 1 exactly every 
ML slots, it should converge even for our algorithm, since our algorithm updates the a posteriori 
probabilities more frequently. Furthermore by considering an ML-times undersampled version 
of the Markov chain that determines the evolution of channel 1, without loss of generality, it is 
sufficient to show convergence for a sensing policy in which channel 1 is sensed in every slot. 
It is obvious that since condition (fT6l) holds for the original Markov chain, it holds even for the 
undersampled version. So now we assume that an observation Y^ is made on channel 1 in every 
slot k. We use Y'' to represent all observations on channel 1 up to slot k. 

We use /ij. G 6 to represent the true realization of random variable 9i with i* E {1, . . . , A^}, 
and n to denote the prior distribution of 9i. The a posteriori probability mass function of 9i 
evaluated at fii after n time slots can be expressed as 
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where we use the notation P,;(.) to denote the distribution of the observations conditioned on 
9i taking the value yUj G 0. It follows from [fT2l Theorem 1, Theorem 2, and Lemma 6] that 
conditioned on {9i = yUj.} we have, 

^,,, , ;■ oo for all 2 7^ 2 . 

Hence, it follows from (l46l) that conditioned on {^i = /ij.} we have, 

which further implies that conditioned on {^i = /Xj.} we have. 

Since this holds for all possible realizations /ij. G of 9i, the result follows. ■ 
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